Remove unnecessary calls to CodecPool.returnCompressor/returnDecompresso... #103

themodernlife · 2014-11-22T16:17:22Z

...r to avoid race conditions

The input/output stream implementations erroneously add the (de)compressors back to the CodecPool on close, even though they didn't get the (de)compressors from the pool. The user who creates the (de)compressor is responsibile for doing this and if they both return the decompressor, you will end up with the same instance in the pool twice which leads to a race condition.

This fixes #91 and #94.

There was some concern that this might break some code in the wild.

FWIW, I did a quick search on GitHub to see how people are using this library, and there really wasn't much to speak of outside of forks/hadoop code. The code I did find properly uses CodecPool (getting and returning) so this patch wouldn't be an issue. This patch also should work cleanly with any Hadoop setups.

The only way I can see that a user could run into a problem is if they get the decompressor/compressor from the CodecPool and then don't return it, in which case they are really using CodecPool wrong, which I would hope is not common enough to justify keeping this fix out.

My main motivation is that this makes it possible to use Spark safely with LZO (see #91).

Hope you guys can incorporate it one way or another! Maybe a 0.5.x release (to clearly signal any potential change of behavior)?

…ssor to avoid race conditions The input/output stream implementations erroneously add the (de)compressors back to the CodecPool on close. The user who creates the (de)compressor is responsibile for doing this, and if they return a decompressor as well, you will have the same instance in the pool twice.

sjlee · 2014-12-03T02:02:37Z

Thanks for the PR @themodernlife. Unfortunately, however, I'm afraid that the problem is more complicated.

Most use cases go through LzopCodec to create the input stream. However, LzopCodec itself has two conflicting ways of managing the decompressor instances. For example, one can call createInputStream(InputStream, Decompressor) to obtain the input stream (like LineRecordReader does). In this case, in principle it is the caller that's responsible for returning the decompressor to the pool.

On the other hand, LzopCodec also has createInputStream(InputStream), in which case it is LzopCodec itself that obtains the decompressor (see line 113). In that case, LzopCodec relies on LzopInputStream.close() for returning the decompressor. There is no obvious lifecycle method that you could use to have LzopCodec return the decompressor.

So if we removed the call to return the decompressor within LzopInputStream.close(), we would in fact leak the decompressors for all use cases that go through LzopCodec.createInputStream(InputStream). I know for a fact that there are tons of use cases for that.

rangadi · 2014-12-03T20:25:06Z

When LzopCodec creates the decompressor, it could return a filtered inputstream that returns the decompressor when the stream is closed.

That is probably all the that is required.

sjlee · 2014-12-03T23:19:57Z

That sounds like a good approach. @themodernlife, would you like to update your PR to do that, both for the compressor and the decompressor?

…sors

themodernlife · 2014-12-03T23:58:24Z

PR updated. Good idea @rangadi!

rangadi · 2014-12-04T00:13:48Z

src/main/java/com/hadoop/compression/lzo/LzopCodec.java

+    }
+
+    @Override
+    public int read() throws IOException {


you should implement read(byte b[], int off, int len), otherwise, reads will be very slow.
actually even better is to make it extend FilterInputStream and override only close().
Same for OutputStream.

rangadi · 2014-12-04T17:13:11Z

src/main/java/com/hadoop/compression/lzo/LzopCodec.java

-import java.io.IOException;
-import java.io.InputStream;
-import java.io.OutputStream;
+import java.io.*;


minor : I don't think hadoop code base encourage * import.

rangadi · 2014-12-04T17:14:24Z

+1. Thanks for the updates.

sjlee · 2014-12-05T01:02:10Z

LGTM. Thanks @themodernlife for your contribution! I'll merge it shortly.

Remove unnecessary calls to CodecPool.returnCompressor/returnDecompresso...

themodernlife added 2 commits December 3, 2014 18:35

Ensure users who don't supply a (de)compressor don't leak (de)compres…

efdf80c

…sors

Fixup bogus imports

cc2c7cd

rangadi reviewed Dec 4, 2014
View reviewed changes

Incorporate @rangadi's feedback

9ae0ca5

rangadi reviewed Dec 4, 2014
View reviewed changes

Fix up imports

72d71d1

sjlee added a commit that referenced this pull request Dec 5, 2014

Merge pull request #103 from themodernlife/fix-codec-pooling

d62701d

Remove unnecessary calls to CodecPool.returnCompressor/returnDecompresso...

sjlee merged commit d62701d into twitter:master Dec 5, 2014

sjlee mentioned this pull request Dec 5, 2014

Demonstrate race condition to discuss how to fix #91 #94

Open

rangadi mentioned this pull request May 20, 2015

Potential thread safety issue with LzoDecompressor #106

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unnecessary calls to CodecPool.returnCompressor/returnDecompresso... #103

Remove unnecessary calls to CodecPool.returnCompressor/returnDecompresso... #103

themodernlife commented Nov 22, 2014

sjlee commented Dec 3, 2014

rangadi commented Dec 3, 2014

sjlee commented Dec 3, 2014

themodernlife commented Dec 3, 2014

rangadi Dec 4, 2014

rangadi Dec 4, 2014

rangadi commented Dec 4, 2014

sjlee commented Dec 5, 2014

Remove unnecessary calls to CodecPool.returnCompressor/returnDecompresso... #103

Remove unnecessary calls to CodecPool.returnCompressor/returnDecompresso... #103

Conversation

themodernlife commented Nov 22, 2014

sjlee commented Dec 3, 2014

rangadi commented Dec 3, 2014

sjlee commented Dec 3, 2014

themodernlife commented Dec 3, 2014

rangadi Dec 4, 2014

Choose a reason for hiding this comment

rangadi Dec 4, 2014

Choose a reason for hiding this comment

rangadi commented Dec 4, 2014

sjlee commented Dec 5, 2014